Selective CRLA based Layout Analysis and Text Region Extraction from Low Quality Document Images

نویسندگان

  • Jie Xi
  • Jianming Hu
چکیده

This paper aims at detecting textual regions by separating graphical regions using Selective CRLA scheme and statistical textual properties on noise infected and low resolution newspaper images. A Bottom Up approach is adopted (i.e.) Selective Constrained Run Length algorithm (CRLA) is applied to obtain the layouts and region growing method over it, segments the homogeneous regions. Statistical properties such as Black Run length, Transition rate are proposed to extract the textual regions appropriately. The proposed method can be used to locate text in-group of newspaper images with multiple page layouts or complicated layouts. Initial results are encouraging, then they are experimented with considerable number of newspaper images with different layout structures and promising results were obtained. This finds its major application in digital libraries for OCR where information can be of different quality depending on the age of the scanned paper. References: [1] Karl Tombre, Salvatore Tabbone, Bart Lamiroy, Text / Graphics Separation Revisited – 2002. [2] S. Chanda and U. Pal., “English, Devnagari and Urdu Text Identification”, August 2003. [3] C. H. Chan, L. F. Pau and P. S. P. Wang, “Handbook of Pattern Recognition & Computer Vision”, (2nd Edition), 1999. [4] T. Pavlidis and J. Zhou, “Page Segmentation and Classification,” Computer Vision Graphics Image Processing, 54(6), pp.484-496, November 1992. [5] D. Wang and S. N. Srihari, “Classification of newspaper image blocks using texture analysis,” Computer Vision, Graphics, and Image Processing, vol. 47, pp. 327-352, Jan. 1989. [6] L. A. Fletcher and R. Kasturi, “A robust algorithm for text string separation from mixed text/graphics images,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 10, no. 6, pp. 910-918, Nov. 1988. [7] D. X. Le and G. R. Thoma, “Document classification using connectionist models,” Proc. of IEEE International Conference on Neural Networks, Orlando, Florida, vol. 5, pp. 3009-3014, June 1994. [8] J. Ohya, A. Shio and S. Akamatsu, “Recognizing Characters in Scene Images”, IEEE Transaction on PAMI, Volume 16, No. 2, pp. 214-224, February 1994. [9] W.S. Baird, S. E. Jones, S. J. Fortune.: Image segmentation by shape directed covers. Proc. Of ICPR, pp. 820-825, 1990 [10] K. Jain and S. Bhattacharjee, “Text Segmentation Using Gabor Filters for Automatic Document Processing,” Machine Vision and Applications, 5(3), pp. 169-184, 1992. [11] Y Tang, C. D. Yan, C. Y. Suen, Document Processing for Automatic Knowledge Acquisition, IEEE Trans. On PAMI, 16 (I): 3-21, 1994 [12] Haiqin Wnag ,Ruwei Dai, Document layout understanding algorithm based on projection and recursion, Pattern Recognition and Artificial Intelligence, 10(2),1997. [13] L. O'Gorman, The Document Spectrum for Page Layout Analysis, IEEE Trans. On PAMI, 15(11): 1162-1173,1993 [14] Jie Xi, Jianming Hu and Lide Wu, Page segmentation of Chinese newspapers. Pattern Recognition, 35(12): 2695-2704,2002 [15] A.Simon, J.C.Pret, A.P.Johnson, A Fast Algorithm for Bottom-up Document Layout Analysis, IEEE Trans. OnPAMI, 19(3): 273-276, 1997 [16] A. K. Jain, Y. Zhong, Page Segmentation Using Texture Analysis, Pattern Recognition, 29(5): 743-770, 1996 [17] A. Antonacopoulos, Page segmentation using the description of the background, Computer Vision and Image Understanding, 70(3): 350-369, 1998 [18] SI Ming Chen, Xiao-Qing Ding , Jian Liang, Document Layout Analysis, Understanding and Reconstruction of complicate Chinese newspaper, Journal of Tsinghua university(Natura1 Science Edition)41( 1):29-32,2001 [19] F. Cesarini, M. Gori, S. Marinai, and G. Soda, “Structured document segmentation and representation by the modified XY tree,” Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 14, no. 2, pp.85-88, 1999. [20] Jean-Luc Meunier, “Optimized XY-Cut for Determining a Page Reading Order”, Proceedings of the 2005 Eight International Conference on Document Analysis and Recognition, vol. 6, no. 13, pp.58-62, 2005. [21] Simone Marinai, Emanuele Marino, Francesca Cesarini, and Giovanni Soda, “A General System for the Retrieval of Document Images from Digital Libraries”, Proceedings of the First International Workshop on Document Image Analysis for Libraries, vol. 18, no. 14, pp.274-299, 2004. [22] Simone Marinai, Emanuele Marino, Giovanni Soda, “Tree clustering for layout-based document image retrieval”, Proceedings of the 2006 Second International Conference on Document Image Analysis for Libraries, vol. 23, no. 4, pp.92-95, 2006. [23] Fei Liu, Yupin Luo, Masataka Yoshikawa, Dongcheng Hu, “A New Component based Algorithm for Newspaper Layout Analysis”, Proceedings of the 2001 Sixth International Conference on Document Analysis and Recognition, vol. 19, no. 5, pp.171-175, 2001. [24] Xue-Dong Tian, Chong Zhang, “A Chinese Document Layout Analysis Method Based On Minimal Spanning Tree Clustering”, Proceedings of the Second International Conference on Machine Learning and Cybernetics, Xi’an, vol. 4, no. 3, pp.35-39, 2003. [25] Hung-Ming Sun, “Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing”, International Journal of Applied Science and Engineering, vol. 4, no. 3, pp.297-309, Dec 2006.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing

The Constrained Run-Length Algorithm (CRLA) is a well-known technique for page segmentation. The algorithm is very efficient for partitioning documents with Manhattan layouts but not suited to deal with complex layout pages, e.g. irregular graphics embedded in a text paragraph. Its main drawback is to use only local information during the smearing stage, which may lead to erroneous linkage of t...

متن کامل

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...

متن کامل

Information Extraction from Document Images using Attention Based Layout Segmentation

Introduction The attention of a human reader and the reading speed strongly depends on the layout of a document The term layout is used for the geometrical arrangement of document components (i.e. text, graphics and figures) on the page as well as for the typographic features of the text (i.e. font type, style, size, alignment and line spacing). Although the human visual and cognitive perceptio...

متن کامل

Newspaper Headlines Extraction from Microfilm Images

Automatic indexing is important for a digital library to provide digitized manuscripts of old document images and their electronic text. As an essential step in creating such a system, this paper discusses the issue of extracting headlines from old newspaper microfilms. Most research on document layout analysis has largely assumed relatively clean images. However microfilm images of old newspap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009